Red Wine Exploration by Piranut Lapprathana
Univariate Plots
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
Most wines score around 5.6 in terms of quality. The median density is 0.997 g/cm^3. 75% of wine has pH less than 3.4 (most wines are between 3-4 on the pH scale)
Exploratory, quick histogram plots
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Distribution of wine quality

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
Distribution of residual.sugar
## Warning: Removed 11 rows containing non-finite values (stat_bin).

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
Distribution of citric acid
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
##
## 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13 0.14
## 132 33 50 30 29 20 24 22 33 30 35 15 27 18 21
## 0.15 0.16 0.17 0.18 0.19 0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29
## 19 9 16 22 21 25 33 27 25 51 27 38 20 19 21
## 0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.4 0.41 0.42 0.43 0.44
## 30 30 32 25 24 13 20 19 14 28 29 16 29 15 23
## 0.45 0.46 0.47 0.48 0.49 0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59
## 22 19 18 23 68 20 13 17 14 13 12 8 9 9 8
## 0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69 0.7 0.71 0.72 0.73 0.74
## 9 2 1 10 9 7 14 2 11 4 2 1 1 3 4
## 0.75 0.76 0.78 0.79 1
## 1 3 1 1 1
Distribution of volatile acidity

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
No wine has volatile acidity value of 0, min. is 0.12, max. is 1.58.
##
## 0.12 0.16 0.18 0.19 0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27
## 3 2 10 2 3 6 6 5 13 7 16 14
## 0.28 0.29 0.295 0.3 0.305 0.31 0.315 0.32 0.33 0.34 0.35 0.36
## 23 16 1 16 2 30 2 23 20 30 22 38
## 0.365 0.37 0.38 0.39 0.395 0.4 0.41 0.415 0.42 0.43 0.44 0.45
## 2 24 35 35 2 37 33 3 31 43 23 22
## 0.46 0.47 0.475 0.48 0.49 0.5 0.51 0.52 0.53 0.54 0.545 0.55
## 31 21 2 24 35 46 24 33 29 31 5 20
## 0.56 0.565 0.57 0.575 0.58 0.585 0.59 0.595 0.6 0.605 0.61 0.615
## 34 1 28 3 38 3 39 1 47 3 27 6
## 0.62 0.625 0.63 0.635 0.64 0.645 0.65 0.655 0.66 0.665 0.67 0.675
## 24 3 29 9 27 12 16 7 26 3 23 3
## 0.68 0.685 0.69 0.695 0.7 0.705 0.71 0.715 0.72 0.725 0.73 0.735
## 12 11 23 7 10 6 3 12 5 9 6 8
## 0.74 0.745 0.75 0.755 0.76 0.765 0.77 0.775 0.78 0.785 0.79 0.795
## 11 5 6 3 5 5 6 4 10 8 2 2
## 0.8 0.805 0.81 0.815 0.82 0.825 0.83 0.835 0.84 0.845 0.85 0.855
## 3 1 2 3 5 1 4 4 8 1 2 3
## 0.86 0.865 0.87 0.875 0.88 0.885 0.89 0.895 0.9 0.91 0.915 0.92
## 2 1 4 2 5 5 1 1 3 3 4 1
## 0.935 0.95 0.955 0.96 0.965 0.975 0.98 1 1.005 1.01 1.02 1.025
## 2 1 1 3 3 1 3 3 1 1 4 1
## 1.035 1.04 1.07 1.09 1.115 1.13 1.18 1.185 1.24 1.33 1.58
## 1 3 1 1 1 1 1 1 1 2 1
Density of wine
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0040
Chlorides

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
Categorizing wine into different ratings (bad, average, good) based on quality. ‘bad’ (0-4), ‘average’ (5-6), ‘good’ (7-10)
## bad average good
## 63 1319 217
Combining all acids (fixed, volatile(acetic acid), citric acid)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.270 7.827 8.720 9.118 10.070 17.050
Univariate Analysis
Structure of dataset:
There are 1599 wines in the dataset with 12 features(fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol, quality).
quality 0(worst)- 10(best)
Most wines score around 5.6 in terms of quality (values ranged from 3-8). The median density is 0.997 g/cm^3. 75% of wines have pH less than 3.4 (most wines are between 3-4 on the pH scale).
Main feature(s) of interest:
Volatile.acidity and residual sugar are interesting features to explore. I want to find out how these two features correlate with the quality of wine. Other features like chlorides and density may help explain the quality of wine.
Other features that may support the investigation into the feature(s) of interest:
chlorides, density, pH and alcohol
New variables created:
I created a ‘rating’ variable which classifies wines into ‘bad’,‘average’ and ‘good’. I also created a variable called ‘total.acidity’ to hold the sum of all acids (fixed, volatile, and citric).
Bivariate Plots
Correlation between alcohol and density 
##
## Pearson's product-moment correlation
##
## data: rw$alcohol and rw$density
## t = -22.838, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5322547 -0.4583061
## sample estimates:
## cor
## -0.4961798
Quality and volatile.acidity

## rw$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4400 0.6475 0.8450 0.8845 1.0100 1.5800
## --------------------------------------------------------
## rw$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.230 0.530 0.670 0.694 0.870 1.130
## --------------------------------------------------------
## rw$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.180 0.460 0.580 0.577 0.670 1.330
## --------------------------------------------------------
## rw$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1600 0.3800 0.4900 0.4975 0.6000 1.0400
## --------------------------------------------------------
## rw$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3000 0.3700 0.4039 0.4850 0.9150
## --------------------------------------------------------
## rw$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2600 0.3350 0.3700 0.4233 0.4725 0.8500
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
Higher quality wines have lower volatile acidity (high levels of volatile acidity can lead to an unpleasant, vinegar taste)
Quality and alcohol

## rw$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.400 9.725 9.925 9.955 10.580 11.000
## --------------------------------------------------------
## rw$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.00 9.60 10.00 10.27 11.00 13.10
## --------------------------------------------------------
## rw$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.5 9.4 9.7 9.9 10.2 14.9
## --------------------------------------------------------
## rw$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.80 10.50 10.63 11.30 14.00
## --------------------------------------------------------
## rw$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.20 10.80 11.50 11.47 12.10 14.00
## --------------------------------------------------------
## rw$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.80 11.32 12.15 12.09 12.88 14.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
Higher quality wines have a higher percentage of alcohol, quality 8 has the highest median alcohol content (12.15)
Generating box plots for each variable against quality 

Calculating correlations against quality for each variable
## fixed.acidity volatile.acidity citric.acid
## 0.12405165 -0.39055778 0.22637251
## total.acidity log10.residual.sugar log10.chlordies
## 0.10375373 0.02353331 -0.17613996
## free.sulfur.dioxide total.sulfur.dioxide density
## -0.05065606 -0.18510029 -0.17491923
## pH log10.sulphates alcohol
## -0.05773139 0.30864193 0.47616632
Correlation
##
## Pearson's product-moment correlation
##
## data: rw$alcohol and rw$density
## t = -22.838, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5322547 -0.4583061
## sample estimates:
## cor
## -0.4961798
##
## Pearson's product-moment correlation
##
## data: rw$residual.sugar and rw$density
## t = 15.189, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3116908 0.3973835
## sample estimates:
## cor
## 0.3552834
##
## Pearson's product-moment correlation
##
## data: rw$citric.acid and rw$pH
## t = -25.767, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5756337 -0.5063336
## sample estimates:
## cor
## -0.5419041
##
## Pearson's product-moment correlation
##
## data: rw$fixed.acidity and rw$pH
## t = -37.366, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.7082857 -0.6559174
## sample estimates:
## cor
## -0.6829782
Correlations between variables faceted by rating 

##
## Pearson's product-moment correlation
##
## data: rw$fixed.acidity and rw$citric.acid
## t = 36.234, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6438839 0.6977493
## sample estimates:
## cor
## 0.6717034

##
## Pearson's product-moment correlation
##
## data: rw$volatile.acidity and rw$citric.acid
## t = -26.489, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5856550 -0.5174902
## sample estimates:
## cor
## -0.5524957

##
## Pearson's product-moment correlation
##
## data: log10(rw$total.acidity) and rw$pH
## t = -39.663, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.7283140 -0.6788653
## sample estimates:
## cor
## -0.7044435
Strong negative correlation between total acidity and pH as expected (higher acidity means lower pH value)
Bivariate Analysis
Relationships observed between feature(s) of interest and other features in the dataset:
Quality of wine correlates strongly with volatile acidity and alcohol. As wine quality increases, volatile acidity decreases(high of levels of volatile acidity can lead to an unpleasant, vinegar taste). As wine quality increases, the percentage of alcohol increases (quality 8 has the highest median alcohol content). Highest quality wines have the lowest median density and lowest quality wines have the highest median density.
Most wines have residual sugar between 1 and 5 g/dm3. But there are some outliers for quality 6 and 5.
Interesting relationships between the other features:
The following variables have relatively higher correlation with quality: - volatile.acidity - log10.sulphates - alcohol - citric acid
There is a moderate negative correlation between alcohol and density. A weak positive correlation (0.355) between sugar and density. There is a moderate negative correlation (-5.42) between citric acid and pH level as expected since higher level of acidity gives a lower value on the pH scale. A strong negative correlation (-0.683) between total acidity and pH as expected since a higher acidity gives a lower pH value.
Strongest relationships:
Quality correlates strongly with alcohol and volatile acidity. Strong negative relationship between pH and total acidity. Citric acid and fixed acidity. Citric acid and volatile acidity.
Multivariate Plots
Examining 4 variables with the highest correlation. alcohol, volatile acidity, pH and sulphates
Relationship between citric acid and volatile acidity by quality 



##
## Pearson's product-moment correlation
##
## data: rw$free.sulfur.dioxide and rw$total.sulfur.dioxide
## t = 35.84, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6395786 0.6939740
## sample estimates:
## cor
## 0.6676665
Strong positive correlation between free sulfur dioxide and total sulfur dioxide (0.67)
Multivariate Analysis
Relationships observed:
4 main features (citric acid, volatile acidity, pH and alcohol) which correlate strongly with quality are examined. I faceted the plots by rating to separate the scatterplots into three categories: bad, average, and good. Good wines tend to have higher citric acid and lower volatile acid. Bad quality wines have lower levels of citric acid compared to average and good quality wines. This shows that quality is determined by the type of acid present.
Final Plots and Summary
Plot One

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
This shows the normal distribution of volatile acidity present in wines. Most wines have volatile acid between 0.39 and 0.64 g/dm^3.
Plot Two

These boxplots show how different variables affect the quality of wine. Good quality wines have lower volatile acidity, pH and higher alcohol and citric acid. The outliers in each plot demonstrates that quality is affected by various factors.
Plot Three
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality" "rating" "total.acidity"

High quality wines have higher levels of citric acid and lower levels of volatile acidity than low quality wines.